1 Overview

After you have completed running the cyclone pipeline, there will be a number of useful outputs, including:

  • ‘batch_qc_plots.pdf’
  • ‘feature_plots.png’
  • ‘plots.pdf’
  • ‘split_umap_by_clustering.png’
  • various checkpoint_<#>.RData files.

There are many directions that one can follow up with just these outputs.

  • Batch effect assessment: ‘batch_qc_plots.pdf’ contains 2 umap plots (one colored by and one faceted by the “pool_id” batch information from your file_metadata) as well as numerous heatmaps that either show data by batch directly, or are annotated with batch information. These plots can be useful for assessing if features of your data heavily associate with processing batch. Depending on your experimental design, such association may be expected, but when unexpectedly present, it can be a good idea to perform some form of batch correction. CytoNorm and cyCombine are two common pathways for batch correction of CyTOF data which our team has used.

  • Cluster Annotation: Now that your similar cells are grouped together into “clusters”, it’s time to figure out what each cluster represents. This can be a pretty manual process, but is one of the most important steps as annotating cluster biology. It lets us make sense of what differences in cluster frequencies (see below) between samples actually mean. The ‘split_umap_by_clustering.png’, ‘feature_plots.png’, and ‘plots.pdf’ outputs are all useful here. ‘split_umap_by_clustering.png’ and ‘feature_plots.png’ can be used together to determine cell type / lineage defining markers that each cluster expresses, or for a different view, the heatmaps contained within ‘plots.pdf’ show the median expression of each marker within each cluster.

There are also many other directions that your follow up analysis might take you – too may for us to attempt to build them all in to cyclone directly. Luckily, its easy to interface with other tools, like the visualization package dittoSeq.

1.1 Scope of this vignette

In this vignette, we will describe a few such directions. Specifically, we will start with direct cyclone outputs and take them through the follow up steps:

  1. Consolidate the data into a SingleCellExperiment object
  2. Add cell type annotations for our clusters
  3. Exploring the dataset with dittoSeq (a color vision deficiency & novice coder friendly visualization tool)
  4. Run statistics on differences in cluster and cell type frequencies between samples.

2 1. Importing into an SCE

Here, we make use of the SingleCellExperiment object structure because it holds everything we need for our cytometry analysis: space for expression matrices, per-cell metadata, dimensionality reductions like umap, and even per-marker metadata although we won’t be making use of that in this vignette.

2.1 Digression1: A mini summary of these SCE components

The ‘assays’ slot holds matrices in the form of features (rows) by cells (columns). They can be accessed with assay(, ) OR a default assay will be retrieved if with simply assay().

The ‘colData’ slot holds metadata for cells, and the ‘rowData’ slot holds metadata for markers. Here, we only make use of ‘colData’ but the slots work similarly. Their structure is a DataFrame where each rows holds data for a col (cell) or row (marker) of the object, and columns are the different bits of information about them. For example, we will have column in ‘colData’ holding the ‘cluster’ assignments of each cell. For convenience, the SCE maintainers set up this syntax <sce>$<colData_column> to pull directly from colData, which makes these data quite easy to access!

The ‘reducedDims’ slot holds dimensionality reductions, such as UMAP, as matrices. There are accessor functions for these as well, ?reducedDims

2.2 Actually loading in our data

We can make use of helper functions included in the cyclone package for this. Specifically, make_or_load_full_sce(). The main bits of the function are:

  1. Loading checkpoint1 before checkpoint8 because the cell_metadata element of the former is updated, with cluster mapping, in the latter.
  2. Generating the SCE with code equivalent to the lines below:
sce <- SingleCellExperiment(
    assays = list(transformed=t(trans_exp)),
    colData = DataFrame(cell_metadata[, !grepl("UMAP",colnames(cell_metadata))]),
    reducedDims = list(umap=cell_metadata[, grepl("UMAP",colnames(cell_metadata))]))

See ?make_or_load_full_sce for more details.

full_sce <- make_or_load_full_sce(
    # sce_file_name is not used in the first pass because we are neither
    #  loading nor saving, but in later passes this will load the saved version!
    sce_file_name = "cytof_full.Rds",
    # Make sure you update these paths to point toward where you have your own
    #  cyclone outputs saved!
    checkpoint1 = "checkpoint_1.RData",
    checkpoint8 = "checkpoint_8.RData",
    load_checkpoints = TRUE,
    # We won't save yet because we will be adding new columns to colData in the
    #   next step. We'll save after that!
    save = FALSE,
    make_clusters_factors = FALSE
)
## [ 2023-03-16 13:39:08 ] Loading Checkpoint 1
## Loading objects:
##   CHECKPOINT
##   raw_exp
##   trans_exp
##   cell_metadata
##   marker_metadata
##   file_metadata
##   param_list
## [ 2023-03-16 13:39:26 ] Loading Checkpoint 8
## Loading objects:
##   file_metadata
##   marker_metadata
##   cell_metadata
##   cluster_metadata
##   file_by_cluster_freq
##   file_by_cluster_freq_norm
##   file_by_cluster_median_exp
##   cluster_median_exp
##   file_median_exp
##   param_list
## [ 2023-03-16 13:39:29 ] Making SCE.
## [ 2023-03-16 13:39:31 ] Done.

Now let’s look at the summary of the object:

full_sce
## class: SingleCellExperiment 
## dim: 58 2049985 
## metadata(0):
## assays(1): transformed
## rownames(58): Time length ... ID_1 ID_2
## rowData names(0):
## colnames(2049985): cell_1 cell_2 ... cell_2049984 cell_2049985
## colData names(10): file_name pool_id ... cluster cluster_6x6
## reducedDimNames(1): umap
## mainExpName: NULL
## altExpNames(0):

3 2. Add cell type annotations for our clusters

The first step of any cluster-based analysis is making sense of what each cluster represents. cyclone’s direct outputs serve quite well for this purpose:

‘split_umap_by_clustering.png’ and ‘feature_plots.png’ can be used together to determine cell type / lineage defining markers that each cluster expresses, or for a different view, the heatmaps contained within ‘plots.pdf’ show the median expression of each marker within each cluster.

We recommend compiling your cluster assignments in Excel or other table/spreadsheet manipulation tool, then saving them as a .csv or .tsv. (There are also tools for loading directly from .xlsx files, but we won’t cover them here.) As long as you have one column containing the cell ids, and another holding cell type names, that’s enough! But generally, it’s useful to record both ‘coarse’ and ‘fine’-level annotations.

Here, we’ll assume this is done, and that the structure of the annotations.csv file is something like:

cluster coarse fine
1 CD4T CD4T_naive
2 NK NK_mature
3 Monocyte Monocyte_classical

3.1 Read in your annotations

annots <- read.csv("annotations.csv")
head(annots)
##   cluster coarse        fine
## 1       1   CD8T   CD8T_EMRA
## 2       2   CD8T   CD8T_EMRA
## 3       3     NK          NK
## 4       4     NK          NK
## 5       5      B plasmablast
## 6       6    ASC         ASC

3.2 Add annotations to the SCE

This process is simple because of R’s factor function! Factors are a useful data structure in R where “levels” (potential values) of the data are 1. pre-defined and 2. ordered. With the factor function, you can pick a vector to start with, set the levels and their order with the levels input, and then update what any of those levels are called with the labels input. Usefully, if any levels are given a matching label, they will be combined together. Thus, we can make use of this single function to achieve everything we need here – renaming from cluster_ids to annotations, combination of clusters given same annotations within the new factor!

We’ll make use of that function to create and pull in both depths of annotation by starting with the cluster metadata / colData.

head(full_sce$cluster)
## [1] "28" "35" "27" "33" "30" "4"

We’ll make a “coarse_annot” metadata / colData column for the coarse-level.

full_sce$coarse_annot <- factor(
    full_sce$cluster,
    levels = 1:max(as.numeric(unique(full_sce$cluster))),
    labels = annots$coarse
)

We’ll make a “fine_annot” metadata / colData column for the fine-level.

full_sce$fine_annot <- factor(
    full_sce$cluster,
    levels = 1:max(as.numeric(unique(full_sce$cluster))),
    labels = annots$fine
)

3.3 Saving the SCE with these annotations

Now, we can save our SCE to be able to skip re-running a lot of this code in the future, using cyclone’s save_sce helper function. (Alternatively, you can use R’s save or saveRDS function directly. This save_sce function is just a simple wrapper on top of saveRDS that automatically sets compress = TRUE for SCEs with more than a million cells.)

save_sce(full_sce, "cytof_full.Rds")

4 3. Exploring the dataset with dittoSeq

(a color vision deficiency & novice coder friendly visualization tool)

4.1 Pull in any additional metadata

Ideally, you will have added any sample-specific metadata into the sample_metadata that went into the cyclone pipeline at the start. Anything recorded within sample_metadata will have been copied over to the cell_metadata in the checkpoint_8.Rdata that we used for creating our SCE. Thus, all that metadata will already be accessible.

But sometimes we receive certain metadata only later. Such data will need to be mapped and loaded in. For our mock case, we’ll be loading in patient sex.

sex_meta <- read.csv("sex_metadata.csv")
head(sex_meta)
##           ID Sex
## 1 XVIR1-HS14   F
## 2 XVIR1-HS15   F
## 3 XVIR1-HS16   M
## 4 XVIR1-HS19   M
## 5 XVIR1-HS20   F
## 6 XVIR1-HS21   M

Now, we need to match the ‘ID’ column to our ‘donor_id’ column. Then we can map this sex info to our samples.

head(full_sce$donor_id)
## [1] "HS11" "HS11" "HS11" "HS11" "HS11" "HS11"
full_sce$sex <- sex_meta$Sex[match(
    # Target order
    paste0("XVIR1-", full_sce$donor_id),
    # Original data order
    sex_meta$ID
)]
# View a random few
rand_cells <- sample(1:ncol(full_sce), 6, replace = FALSE)
colData(full_sce)[rand_cells, c("donor_id", "sex")]
## DataFrame with 6 rows and 2 columns
##                 donor_id         sex
##              <character> <character>
## cell_470998         HS17           F
## cell_1549081         HS4           M
## cell_1879283         HS8           M
## cell_713282          HS2         N/A
## cell_1979640         HS9           M
## cell_272            HS11           M

Depending on the random seed, you may notice that some samples have a sex of “N/A” because that data was not known. We’ll deal with this later.

4.2 Make some plots with dittoSeq!

The exact set of plots that will be useful for any given data set depends heavily upon the experimental design and upon what sample metadata is actually available. Maybe you won’t need some of these plots, maybe you will. Our goal in this section is to give a relatively comprehensive overview of plots that be commonly useful. Feel free skip around… none of the code in this section makes changes to the underlying sce object, so feel free to only run code that seems useful for your own data.

dittoSeq offers plenty of plotting functions that are useful for cytometry data.

  • dittoDimPlot
  • dittoFreqPlot
  • dittoBarPlot
  • dittoScatterPlot
  • dittoPlot

Primary/Required Inputs: All these functions will take in a target ‘object’ (our down_sce for testing or full_sce when ready to make the real version), as well as one or more marker or metadata names to their ‘var’ and/or ‘group.by’, ‘color.by’, and ‘sample.by’. Those alone are minimally enough to make a plot.

Customize-ability: All of them also have plenty of quite useful tweaks and added functionality built in as well. We’ll go through a few examples, but we’d encourage you to check out dittoSeq’s own vignette and the functions’ own documentation (e.g. ?dittoDimPlot) to learn more!

4.2.1 The basic workflow

1. Create an SCE representing a subset of the data.

Because it can take multiple minutes to calculate and display even a single plot from millions of cells, we recommend creating a (further) downsampled version of your ‘full_sce’ so that visualizations can be tested out and tweaked more quickly. Here, we’ll give ourselves and object containing just 100k cells using cyclone’s make_or_load_downsample_sce helper function.

set.seed(42)
down_sce <- make_or_load_downsample_sce(
    "cytof_down_100k.Rds",
    full_sce,
    n_keep = 100000,
    save = FALSE)
## [ 2023-03-16 13:39:35 ] Creating downsampled SCE
## [ 2023-03-16 13:39:35 ] Done.
down_sce
## class: SingleCellExperiment 
## dim: 58 100000 
## metadata(0):
## assays(1): transformed
## rownames(58): Time length ... ID_1 ID_2
## rowData names(0):
## colnames(100000): cell_1109989 cell_54425 ... cell_555638 cell_1901363
## colData names(13): file_name pool_id ... fine_annot sex
## reducedDimNames(1): umap
## mainExpName: NULL
## altExpNames(0):

2. Test and tweak plots on the reduced sce object. (use ‘object = down_sce’)

3. Make final plots using the full sce object. (switch to using ‘object = fullsce’)

Note: For the final version of this vignette, we will be using the ‘full_sce’. To implement this suggested workflow for your own data, replace the ‘full_sce’ in code below with ‘down_sce’ for the initial views and tweaking stage, then switch back to ‘full_sce’ when producing final figures!

4.2.2 UMAP Plots with dittoDimPlot

Marker expression or cell metadata can be plotted on the UMAP using dittoDimPlot()

Some use cases:

  • Assessing distributions of sample features or experimental arms
  • Viewing locations of clusters in the umap space.

Primary inputs = object and var. (You can leave out those input names if you like, as long as you provide the ‘object’ first, and ‘var’ second.)

# CD3 Marker Expression
dittoDimPlot(object = full_sce, var = "CD3")

# Sample metadata example 1: sample groups or processing batch metric
dittoDimPlot(full_sce, "pool_id")

# Sample metadata example 2: coarse-level cell type annotations
dittoDimPlot(full_sce, "coarse_annot")

4.2.2.1 Some particlarly useful dittoDimPlot tweaks

# Label the color groups & remove the legend
dittoDimPlot(full_sce, "coarse_annot",
    # Add labels
    do.label = TRUE,
    # Remove the legend
    legend.show = FALSE)

# Adjust plot order, and also the title
dittoDimPlot(full_sce, "CD16",
    # Plot cells with higher expression in the front with 'order = "increasing"'
    #   Also try: "decreasing" or "randomize"
    order = "increasing",
    # Adjust the plot title
    main = "CD16 Expression")

# Only highlighting certain cells with 'cells.use'
dittoDimPlot(full_sce, "cluster",
    cells.use = full_sce$cluster==4)

## Faceting is VERY useful!
# Example 1: Simple recreation of the effect of the 'split_umap_by_clustering.png'
dittoDimPlot(down_sce, "cluster",
    # Create faceted plots where points are 'split' into different facets by a
    #   metadata given to 'split.by'.
    # Faceting can help make distribution differences more visible!
    split.by = "cluster")

# Example 2: Batch Correction Assessment
#   Ideally, every batches would have cells in all the gray regions of the umap
dittoDimPlot(full_sce, "pool_id",
    split.by = "pool_id")

4.2.3 BoxPlots of cluster or cell type frequencies per sample with dittoFreqPlot

Cell frequencies per-sample, grouped by one or more important metadata can be plotted using dittoFreqPlot()

Use case:

  • Visualize if/how cell frequencies are different across sample groups within your dataset

Primary inputs = object, var, sample.by, and group.by. (You can leave out those input names if you like, as long as you provide the ‘object’ first, and ‘var’ second.)

# Coarse-level frequencies per sample, between sexes
dittoFreqPlot(object = full_sce, var = "coarse_annot",
              sample.by = "donor_id", group.by = "sex")

# Cluster frequencies per sample, between sexes, with fine-level annotations in
#    facet labels.
dittoFreqPlot(object = full_sce,
              var = paste(full_sce$fine_annot, full_sce$cluster, sep = "__"),
              sample.by = "donor_id", group.by = "sex")

4.2.3.1 Some particlarly useful dittoFreqPlot tweaks

# Allow y-axis to shrink/stretch per data in each facet
dittoFreqPlot(full_sce, "coarse_annot", "file_name", group.by = "sex",
    split.adjust = list(scale="free_y"))

# Only show certain cell types
#   Here: all fine-level annotations in coarse-level "CD4T"
dittoFreqPlot(full_sce, "fine_annot", "file_name", group.by = "sex",
    vars.use = unique(full_sce$fine_annot[full_sce$coarse_annot=="CD4T"]))

# **Adjust percentage normalization to a restricted universe** & only show cell
#   types within that universe.
#   Here: all fine-level annotations in coarse-level "CD4T" cells
dittoFreqPlot(full_sce, "fine_annot", "file_name", group.by = "sex",
    # cells.use adjusts the 'universe' for percent calculation.
    cells.use = full_sce$coarse_annot=="CD4T",
    # vars.use limits cells types shown, but does not affect percent calculation
    vars.use = unique(full_sce$fine_annot[full_sce$coarse_annot=="CD4T"]))

# Coarse-level frequencies per sample, between sexes + also between batch
#   subgroup with color.by
# Also targeting just one cell type, for visibility in the same sized plot
#   For all cells, you'd want to make this plot quite large!
dittoFreqPlot(full_sce, "coarse_annot", "file_name",
    group.by = "pool_id",
    color.by = "sex",
    vars.use = c("CD8T", "NK", "B", "ASC"))

4.2.4 Stacked Bar Plots of cluster/batch/group composition with dittoBarPlot

Cell metadata composition per / grouped by any other cell metadata can be plotted with dittoBarPlot()

Some use cases:

  • Assessing equivalence of batch representation across clusters

Primary inputs = object, var, group.by. (You can leave out those input names if you like, as long as you provide the ‘object’ first, ‘var’ second, and ‘group.by’ third.)

# Sex (or any metadata) breakdown within each cluster.
dittoBarPlot(object = full_sce, var = "sex", group.by = "cluster")

# Cluster make-up per pool / processing batch
#   Ideally, you would see relatively similar distributions here.
dittoBarPlot(full_sce, "cluster", "pool_id")

4.2.4.1 Some particlarly useful dittoBarPlot tweaks

# factor-ized "cluster" metadata for this next example
full_sce$cluster_factor <- factor(
    full_sce$cluster,
    levels = 1:36
)
# Respect factor ordering using 'retain.factor.levels'
#   A 'flaw' relating to updates retaining backwards compatibility, the
#   developers plan to remove the need for this input in a future update.  But
#   as of the writing of this vignette, it is still needed.
dittoBarPlot(full_sce, "sex", group.by = "cluster_factor",
    retain.factor.levels = TRUE
    )

# Ignore certain samples
#   Perhaps, as for the toy dataset here, certain samples were only included as
#   a batch control, but do not have full information and not intended for
#   inclusion in down-stream biological questions.
dittoBarPlot(full_sce, "sex", group.by = "cluster",
    cells.use = !full_sce$donor_id %in% c("HS2")
    )

4.2.5 ScatterPlots, with all the same bells and whistles as umap plotting, dittoScatterPlot

Marker1 by marker2 expression-level scatter plots with dots (cells) optionally colored by metadata or marker3 expression

Some use cases:

  • Cluster annotation, sub-typing help. Ex: T cells CD45RA by CD27 or CCR7

Primary inputs = object, x.var, y.var (You can leave out those input names if you like, as long as you provide the ‘object’ first, ‘x.var’ second, and ‘y.var’ third.)

dittoScatterPlot(object = full_sce, x.var = "CD45RA", y.var = "CCR7")

dittoScatterPlot(full_sce, "CD45RA", "CCR7",
    color.var = "sex")

4.2.5.1 Some particlarly useful dittoScatterPlot tweaks

# Faceting (again, it's super useful!)
dittoScatterPlot(full_sce, "CD45RA", "CCR7", color.var = "cluster",
    split.by = "cluster")

# Setting titles & ...
#   The same 'main', 'sub', 'xlab', 'ylab' inputs can set titles across all
#   dittoSeq visualization functions (only exception is dittoHeatmap, the sole
#   non-ggplot function, where only 'main' can be used)
# focusing on only certain samples + cell types with cells.use in a case where perhaps the
#   control samples are particularly useful for deciding on meaningful value
#   cutoffs for the target markers.
dittoScatterPlot(full_sce, "CD45RA", "CCR7", color.var = "cluster",
    split.by = "cluster",
    cells.use = full_sce$control_sample & full_sce$coarse_annot == "CD4T")

4.2.6 violin (or bar, or ridge) plots of marker expression across groups of cells with dittoPlot

Plots where marker expression-level per cell are plotted as violin and/or box plots on a y-axes, or ridge plots in the x-axis direction.

Some use cases:

  • Cluster annotation, sub-typing help. Ex: T cells CD45RA, CD27, or CCR7 alone across clusters

Primary inputs = object, var, group.by (You can leave out those input names if you like, as long as you provide the ‘object’ first, ‘var’ second, and ‘group.by’ third.)

Very important additional input = plots, which sets the data representations to use. I defaults to c(“jitter”, “vlnplot”) which puts a violin in front of jittered points for the individual cells, but changing to c(“jitter”, “vlnplot”, “boxplot”) will add boxplots on top. Additionally, wrappers dittoBoxPlot, dittoRidgePlot, and dittoRidgeJitter automatically adjust the plots input default to c(“boxplot”, “jitter”), c(“ridgeplot”), and c(“ridgeplot”, “jitter”), respectively!

dittoPlot(object = full_sce, var = "CD45RA", group.by = "cluster")

# For better examples, we'll focus on clusters 1:8 with cells.use
dittoPlot(object = full_sce, var = "CD45RA", group.by = "cluster",
    cells.use = full_sce$cluster %in% 1:8)

# Add a boxplot
dittoPlot(full_sce, "CD45RA", "cluster",
    cells.use = full_sce$cluster %in% 1:8,
    plots = c("jitter", "vlnplot", "boxplot"))

# With boxplot, but no violin plots
dittoPlot(full_sce, "CD45RA", "cluster",
    cells.use = full_sce$cluster %in% 1:8,
    plots = c("jitter", "boxplot"))

4.2.6.1 Some particlarly useful dittoPlot tweaks

# Multiple genes in a single plot, by giving a set of markers to 'var'
dittoPlot(full_sce,
    var = c("CD45RA", "CD27", "CCR7"),
    group.by = "cluster",
    cells.use = full_sce$cluster %in% 1:8,
    # Facets will be used for the 'multivars'
    #   so we can set the faceting shape with split.ncol or split.nrow
    split.ncol = 1
    )

# Faceting or coloring as additional cell grouping mechanisms
dittoPlot(full_sce, "CD45RA",
    group.by = "sex",
    split.by = "coarse_annot")

dittoPlot(full_sce, "CD45RA",
    group.by = "coarse_annot",
    color.by = "sex",
    legend.title = "sex")

5 4. Run statistics on differences in cluster and cell type frequencies between samples.

Frequency comparison is a very common follow up in cytof data analaysis.

The file_by_cluster_freq_norm object output from cyclone can be used for calculating stats, but this is only directly useful at the cluster-level. Additional manipulation is required when wanting to calculate stats at the level of coarse or fine cell annotations, which generally combine multiple clusters together.

For this reason, it can be easier to piggy-back off of dittoSeq’s dittoFreqPlot data collection in order to gather cell frequency numbers in a more flexible way! So, we’ve put together a function for doing that that we’ll share here, and we’ll take you through using it below!

5.1 freq_stats

freq_stata() is a function included in the cyclone package which flexibly generates statistical comparisons between user-requested cell and sample groupings by making use of dittoFreqPlot()’s data collection system.

5.1.1 It’s inputs are:

Primary:

  • object: the SingleCellExperiment (or Seurat, but we’re not using that here) object to target. As with plotting, test with ‘down_sce’, but make sure to switch to ‘full_sce’ to make your final outputs!
  • group.by: String. The name of a metadata within ‘object’ that holds the condition information you wish to compare between.
  • group.1 & group.2: Single values of the ‘group.by’ metadata which name the groups to compare.
  • sample.by: String. The name of a metadata within ‘object’ that contains values which are unique for each sample. Typically, this can be “file_name” for cyclone outputs.
  • cell.by: String. The name of a metadata within ‘object’ that contains the cluster or cell annotation information you wish to target.

Secondary:

  • cell.targs: String vector, optional. If targeting just a subset of the cell clusters or annotations named in the ‘cell.by’ cell annotation metadata is desired, give that set of cell targets here.
  • cells.use: Logical vector, optional. If targeting only a certain timepoint/treatment/condition is required for your biological question, and your are using group.by for something else, give this input <object-name>$<condition-metadata-name> == '<target-condition>' OR <object-name>$<condition-metadata-name> %in% c('<target-condition1>', '<target-condition2>', '<target-condition3>')
  • data.out: Logical. FALSE by default. Setting it to TRUE will alter the output style to give a list of 2 elements: ‘stats’ = the standard output, and ‘data’ = the collected data.frame of cell counts and percentages used for the statistical calculations.

5.1.2 How it works, briefly:

The function does 4 things:

  1. Calls on dittoFreqPlot for data collection, frequency calculation, and normalization.
  2. Loops through each targeted cell type / annotation to calculates statistics by making us of R’s wilcox.test function. It’s a fitting statistical test for comparison between two groups that, unlike a t-test, does not assume the data has a normal distribution.
  3. Combines the outputs from all cell types into a single data.frame.
  4. Applies a False Discovery Rate p-value adjustment.

To give a better understanding of the inputs to the function, a better understanding of that first step can be helpful: The function first makes a call to dittoFreqPlot to gather and normalize cell count data. Within dittoFreqPlot, the number of cells of each ‘sample.by’ sample assigned to each distinct ‘cell.by’ value are gathered. Sample group information is also pulled in at this stage, and only samples whose cells are marked as ‘group.1’ and ‘group.2’ in the ‘group.by’ metadata are targeted for counting. The total number of cells for a given sample are then calculated (ignoring ‘cell.targs’ for now) and the counts data is then normalized as percentages of all cells for the given sample. Finally, if a set of cell names was given to ‘cell.targs’, this data.frame is trimmed to only retain rows representing those ‘cell_meta’ values. At this point, the data has been collected and normalized for the stats calculation steps.

5.2 Using freq_stats to compare between two groups

The frequency comparisons to run should be guided by your biological questions.

Was your study designed to assess differences between 2 groups? Then you want to target the metadata holding which cells belong to those 2 groups with group.by and give the names of those groups to group.1 and group.2.

Was your study designed to assess difference between 3 groups? This function performs just pairwise comparisons, so you’ll want to run it 3 times, targeting 1vs2, 1vs3, an 2vs3 to get the full picture of statistically significant differences. You likely have the group info in a single metadata, and if so you would keep group.by the same for all runs, but then adjust group.1 and group.2 for each of the distinct comparisons.

Another factor is what cell annotation level to target. Often, you’ll want to assess all the levels you have. Here, we have two annotation levels, “coarse_annot”, and “fine_annot”, but we also have the individual clusters level as well. So that makes 3 cell-levels that we would target here.

Finally, is the question of whether you want the ‘universe’ that cell percentages are normalized within to be “all cells of the given sample” versus “all cells of a certain coarse annotation type”. For example, you might be more interested in knowing the percentage of Treg cells out of CD4T cells, than in the percentage of Treg cells out of all cells of the given sample. This can be achieved with our function as well! Just skip down to the next section to see how!

Here We just have one bit of clinical information to compare between that has only two groups -> 1x pairwise group comparison. Plus, we have the “coarse_annot”, “fine_annot”, and “clusters” levels of cell annotation to compare between -> 3x cell type levels. That makes for 3 comparisons total.

How this plays out in inputs to the function:

  • group.by, group.1, group.2: In the data set for this tutorial, our metadata containing the sample grouping information is called ‘sex’, so we’ll use that for ‘group.by’. It’s values are “F”, representing female, and “M”, representing Male, so we’ll use ‘group.1 = “F”’ and ‘group.2 = “M”’.
  • sample.by: For any cyclone dataset, “file_name” can be used for this input.
  • cell.by: Our two levels of annotations are stored in “coarse_annot” and “fine_annot” metadata columns, and we can also calculate statistics at the original “cluster” level, but we’ll still want an idea of what those cells are, so we’ll create a metadata column that’s a combination of “fine_annot” and “cluster” below, and use that!
# Coarse level
freq_stats(
    object = full_sce,
    sample.by = "file_name",
    cell.by = "coarse_annot", 
    group.by = "sex", group.1 = "F", group.2 = "M")
##     cell_group comparison median_g1   median_g2 median_fold_change
## 1         CD8T     F_vs_M 0.2168043 0.205610000          1.0544445
## 2           NK     F_vs_M 0.0550800 0.066380657          0.8297598
## 3            B     F_vs_M 0.0458200 0.055990000          0.8183604
## 4          ASC     F_vs_M 0.0617000 0.033520670          1.8406553
## 5          gdT     F_vs_M 0.0119200 0.023830236          0.5002049
## 6          pDC     F_vs_M 0.0043200 0.004360044          0.9908158
## 7         CD4T     F_vs_M 0.4012000 0.322700000          1.2432600
## 8  neutrophils     F_vs_M 0.0059200 0.008490071          0.6972851
## 9    basophils     F_vs_M 0.0025200 0.002360026          1.0677847
## 10       cMono     F_vs_M 0.1216200 0.122500000          0.9928163
## 11      ncMono     F_vs_M 0.0111600 0.016650165          0.6702636
## 12        cDC1     F_vs_M 0.0142200 0.017800177          0.7988684
## 13        cDC2     F_vs_M 0.0083000 0.009820000          0.8452138
##    median_log2_fold_change           p      padj
## 1               0.07648318 0.437486303 0.6319247
## 2              -0.26923434 0.140065180 0.3641695
## 3              -0.28919172 0.085907678 0.3641695
## 4               0.88021949 0.131333339 0.3641695
## 5              -0.99940902 0.043941702 0.2856211
## 6              -0.01331125 0.591614303 0.7330794
## 7               0.31412803 0.008419404 0.1094523
## 8              -0.52017944 0.676688680 0.7330794
## 9               0.09462074 0.984174128 0.9841741
## 10             -0.01040125 0.676688680 0.7330794
## 11             -0.57719943 0.284039717 0.4764233
## 12             -0.32397015 0.222025353 0.4764233
## 13             -0.24261169 0.293183556 0.4764233
# Fine level
freq_stats(
    object = full_sce,
    sample.by = "file_name",
    cell.by = "fine_annot", 
    group.by = "sex", group.1 = "F", group.2 = "M")
##     cell_group comparison  median_g1   median_g2 median_fold_change
## 1    CD8T_EMRA     F_vs_M 0.01870037 0.030350589          0.6161453
## 2           NK     F_vs_M 0.05508000 0.066380657          0.8297598
## 3  plasmablast     F_vs_M 0.01774000 0.020390224          0.8700248
## 4          ASC     F_vs_M 0.06170000 0.033520670          1.8406553
## 5      CD8T_CM     F_vs_M 0.01460000 0.009570191          1.5255703
## 6      CD8T_EM     F_vs_M 0.07662000 0.089520000          0.8558981
## 7   gdT_CD8pos     F_vs_M 0.00478000 0.010650107          0.4488218
## 8          pDC     F_vs_M 0.00432000 0.004360044          0.9908158
## 9            B     F_vs_M 0.00898000 0.032140313          0.2793999
## 10     CD4T_EM     F_vs_M 0.13832000 0.107430000          1.2875361
## 11 neutrophils     F_vs_M 0.00592000 0.008490071          0.6972851
## 12   basophils     F_vs_M 0.00252000 0.002360026          1.0677847
## 13       cMono     F_vs_M 0.12162000 0.122500000          0.9928163
## 14  CD8T_naive     F_vs_M 0.09618000 0.059070599          1.6282212
## 15      gdT_DN     F_vs_M 0.00476000 0.013380267          0.3557478
## 16      ncMono     F_vs_M 0.01116000 0.016650165          0.6702636
## 17  CD4T_naive     F_vs_M 0.13820276 0.105820000          1.3060174
## 18       Tregs     F_vs_M 0.02338000 0.020950423          1.1159679
## 19        cDC1     F_vs_M 0.01422000 0.017800177          0.7988684
## 20     CD4T_CM     F_vs_M 0.08172000 0.060960000          1.3405512
## 21        cDC2     F_vs_M 0.00830000 0.009820000          0.8452138
##    median_log2_fold_change          p      padj
## 1              -0.69865740 0.85828854 0.9486347
## 2              -0.26923434 0.14006518 0.6481450
## 3              -0.20087160 0.98417413 0.9841741
## 4               0.88021949 0.13133334 0.6481450
## 5               0.60934869 0.85828854 0.9486347
## 6              -0.22448901 0.56434676 0.9473642
## 7              -1.15578535 0.22202535 0.6481450
## 8              -0.01331125 0.59161430 0.9473642
## 9              -1.83959664 0.03564774 0.6481450
## 10              0.36461285 0.17897317 0.6481450
## 11             -0.52017944 0.67668868 0.9473642
## 12              0.09462074 0.98417413 0.9841741
## 13             -0.01040125 0.67668868 0.9473642
## 14              0.70329668 0.30864047 0.6481450
## 15             -1.49107345 0.16604655 0.6481450
## 16             -0.57719943 0.28403972 0.6481450
## 17              0.38517415 0.79643411 0.9486347
## 18              0.15829557 0.79272872 0.9486347
## 19             -0.32397015 0.22202535 0.6481450
## 20              0.42282630 0.67668868 0.9473642
## 21             -0.24261169 0.29318356 0.6481450
### Cluster level, but with fine annotations + cluster names to make interpretation easier
# Create combined metadata
full_sce$fine_annot__cluster <- paste0(full_sce$fine_annot, "__cluster", full_sce$cluster)

# Run stats
freq_stats(
    object = full_sce,
    sample.by = "file_name",
    cell.by = "fine_annot__cluster", 
    group.by = "sex", group.1 = "F", group.2 = "M") # <-- Now using the metadata defined above
##                cell_group comparison   median_g1   median_g2 median_fold_change
## 1           ASC__cluster6     F_vs_M 0.061700000 0.033520670          1.8406553
## 2            B__cluster12     F_vs_M 0.008980000 0.032140313          0.2793999
## 3    basophils__cluster17     F_vs_M 0.002520000 0.002360026          1.0677847
## 4      CD4T_CM__cluster32     F_vs_M 0.033240000 0.039780000          0.8355958
## 5      CD4T_CM__cluster33     F_vs_M 0.038820000 0.022280223          1.7423524
## 6      CD4T_EM__cluster15     F_vs_M 0.015120000 0.015700156          0.9630478
## 7      CD4T_EM__cluster21     F_vs_M 0.033300000 0.039430000          0.8445346
## 8      CD4T_EM__cluster27     F_vs_M 0.036960000 0.050040968          0.7385948
## 9   CD4T_naive__cluster25     F_vs_M 0.091500000 0.054140000          1.6900628
## 10  CD4T_naive__cluster26     F_vs_M 0.041480000 0.058390607          0.7103882
## 11      CD8T_CM__cluster7     F_vs_M 0.014600000 0.009570191          1.5255703
## 12     CD8T_EM__cluster13     F_vs_M 0.033880000 0.029030000          1.1670685
## 13     CD8T_EM__cluster14     F_vs_M 0.016660000 0.024360252          0.6839010
## 14      CD8T_EM__cluster8     F_vs_M 0.007380000 0.030270296          0.2438034
## 15    CD8T_EMRA__cluster1     F_vs_M 0.006820136 0.006510000          1.0476400
## 16    CD8T_EMRA__cluster2     F_vs_M 0.013280000 0.014370153          0.9241377
## 17  CD8T_naive__cluster19     F_vs_M 0.086700000 0.046700000          1.8565310
## 18  CD8T_naive__cluster22     F_vs_M 0.009300000 0.009910097          0.9384368
## 19        cDC1__cluster30     F_vs_M 0.014220000 0.017800177          0.7988684
## 20        cDC2__cluster36     F_vs_M 0.008300000 0.009820000          0.8452138
## 21       cMono__cluster18     F_vs_M 0.020960000 0.032510317          0.6447184
## 22       cMono__cluster24     F_vs_M 0.038420000 0.058050000          0.6618432
## 23       cMono__cluster29     F_vs_M 0.005260000 0.004740000          1.1097046
## 24       cMono__cluster35     F_vs_M 0.046320000 0.031340000          1.4779834
## 25   gdT_CD8pos__cluster9     F_vs_M 0.004780000 0.010650107          0.4488218
## 26      gdT_DN__cluster20     F_vs_M 0.004760000 0.013380267          0.3557478
## 27      ncMono__cluster23     F_vs_M 0.011160000 0.016650165          0.6702636
## 28 neutrophils__cluster16     F_vs_M 0.002620000 0.005310156          0.4933941
## 29 neutrophils__cluster34     F_vs_M 0.003700000 0.003450000          1.0724638
## 30          NK__cluster10     F_vs_M 0.005600000 0.007830000          0.7151980
## 31           NK__cluster3     F_vs_M 0.009800000 0.034120323          0.2872189
## 32           NK__cluster4     F_vs_M 0.025040000 0.027980273          0.8949162
## 33         pDC__cluster11     F_vs_M 0.004320000 0.004360044          0.9908158
## 34  plasmablast__cluster5     F_vs_M 0.017740000 0.020390224          0.8700248
## 35       Tregs__cluster28     F_vs_M 0.012780000 0.013520405          0.9452379
## 36       Tregs__cluster31     F_vs_M 0.011700000 0.006600000          1.7727273
##    median_log2_fold_change          p      padj
## 1               0.88021949 0.13133334 0.4298182
## 2              -1.83959664 0.03564774 0.4298182
## 3               0.09462074 0.98417413 0.9841741
## 4              -0.25912289 0.30864047 0.5050480
## 5               0.80103647 0.22202535 0.4440507
## 6              -0.05432072 0.85828854 0.9655746
## 7              -0.24377153 0.95254862 0.9841741
## 8              -0.43714495 0.06519038 0.4298182
## 9               0.75707686 0.19260224 0.4440507
## 10             -0.49332041 0.12105748 0.4298182
## 11              0.60934869 0.85828854 0.9655746
## 12              0.22288930 0.82723337 0.9655746
## 13             -0.54814068 0.09384527 0.4298182
## 14             -2.03621008 0.07850178 0.4298182
## 15              0.06714305 0.98417413 0.9841741
## 16             -0.11382030 0.73113144 0.9400261
## 17              0.89260944 0.41416497 0.5734592
## 18             -0.09166846 0.32823925 0.5137658
## 19             -0.32397015 0.22202535 0.4440507
## 20             -0.24261169 0.29318356 0.5026695
## 21             -0.63325891 0.13133334 0.4298182
## 22             -0.59543855 0.12105748 0.4298182
## 23              0.15017574 0.19260224 0.4440507
## 24              0.56363007 0.34859862 0.5228979
## 25             -1.15578535 0.22202535 0.4440507
## 26             -1.49107345 0.16604655 0.4440507
## 27             -0.57719943 0.28403972 0.5026695
## 28             -1.01918749 0.37380340 0.5382769
## 29              0.10092891 0.29322390 0.5026695
## 30             -0.48358548 0.11139863 0.4298182
## 31             -1.79977766 0.05923987 0.4298182
## 32             -0.16017547 0.85564793 0.9655746
## 33             -0.01331125 0.59161430 0.7888191
## 34             -0.20087160 0.98417413 0.9841741
## 35             -0.08125055 0.20694986 0.4440507
## 36              0.82597060 0.08212935 0.4298182

Of course, it’s also helpful to have a visual. We’ve described dittoFreqPlot in a previous section. Here’s how we might use it for the cluster-level:

# Visualization of frequency differences at the cluster-level
dittoFreqPlot(
    object = full_sce,
    var = "fine_annot__cluster",
    group.by = "sex",
    sample.by = "file_name",
    # Optionally target only group.1 and group.2 with:
    #   (modify "sex", "F", and "M" for your own data)
    cells.use = full_sce$sex %in% c("F", "M"),
    # Allow y-axis to shrink/expand per range of each cell type
    split.adjust = list(scale = "free_y")
)

5.3 Using freq_stats to compare between two groups

One might also wish to assess cell frequency in terms of “fine” annotation per “coarse” annotation. For example, you might be more interested in knowing the percentage of Treg cells out of CD4T cells, than in the percentage of Treg cells out of all cells of the given sample. Luckily, to achieve such metrics, simply subset the SCE first!

# Create a subset of our data that is only the cells labeled as CD4T in coarse_annot
full_sce_CD4 <- full_sce[, full_sce$coarse_annot=="CD4T"]

# Calculate fine-level statistics within these CD4T cells
freq_stats(
    object = full_sce_CD4,
    sample.by = "file_name",
    cell.by = "fine_annot", 
    group.by = "sex", group.1 = "M", group.2 = "F",
    # Explicitly targeting just the values of 'fine_annot' here that have a 'coarse_annot' of CD4T 
    cell.targs = c("CD4T_EM", "CD4T_naive", "Tregs", "CD4T_CM")
)
##   cell_group comparison  median_g1  median_g2 median_fold_change
## 1    CD4T_EM     M_vs_F 0.32712653 0.34146341          0.9580134
## 2 CD4T_naive     M_vs_F 0.36019454 0.38806264          0.9281866
## 3      Tregs     M_vs_F 0.06723778 0.06680353          1.0065003
## 4    CD4T_CM     M_vs_F 0.21324079 0.21663548          0.9843300
##   median_log2_fold_change         p      padj
## 1            -0.061882246 0.6194476 0.7659395
## 2            -0.107513236 0.7659395 0.7659395
## 3             0.009347672 0.3282393 0.7659395
## 4            -0.022786070 0.6194476 0.7659395
# And the visualization
dittoFreqPlot(
    object = full_sce_CD4,
    var = "fine_annot",
    group.by = "sex",
    sample.by = "file_name",
    # Optionally target only group.1 and group.2 with:
    #   (modify "sex", "F", and "M" for your own data)
    cells.use = full_sce_CD4$sex %in% c("F", "M"),
    # vars.use is the cell.targs equivalent in the dittoFreqPlot function
    vars.use = c("CD4T_EM", "CD4T_naive", "Tregs", "CD4T_CM"),
    # Allow y-axis to shrink/expand per range of each cell type
    split.adjust = list(scale = "free_y"),
    # Update the y-axis label to mention the adjusted universe
    ylab = "Percent of CD4T cells"
)